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Abstract 

This paper introduces the Metric-Free Natural Gradient (MFNG) algorithm for 
training Boltzmann Machines. Similar in spirit to the Hessian-Free method of 
Martens [8 1, our algorithm belongs to the family of truncated Newton methods and 
exploits an efficient matrix-vector product to avoid explicitly storing the natural 
gradient metric L. This metric is shown to be the expected second derivative 
of the log-partition function (under the model distribution), or equivalently, the 
covariance of the vector of partial derivatives of the energy function. We evaluate 
our method on the task of joint- training a 3-layer Deep Boltzmann Machine and 
show that MFNG does indeed have faster per-epoch convergence compared to 
Stochastic Maximum Likelihood with centering, though wall-clock performance 
is currently not competitive. 



1 Introduction 

Boltzmann Machines (BM) have become a popular method in Deep Learning for performing fea- 
ture extraction and probability modeling. The emergence of these models as practical learning 
algorithms stems from the development of efficient training algorithms, which estimate the negative 
log-likelihood gradient by either contrastive |S1 or stochastic |[T8l[T9l approximations. However, the 
success of these models has for the most part been limited to the Restricted Boltzmann Machine 
(RBM) 1 6 1, whose architecture allows for efficient exact inference. Unfortunately, this comes at the 
cost of the model's representational capacity, which is limited to a single layer of latent variables. 
The Deep Boltzmann Machine (DBM) [15] addresses this by defining a joint energy function over 
multiple disjoint layers of latent variables, where interactions within a layer are prohibited. While 
this affords the model a rich inference scheme incorporating top-down feedback, it also makes train- 
ing much more difficult, requiring until recently an initial greedy layer-wise pretraining scheme. 
Since, Montavon and Muller [9| have shown that this difficulty stems from an ill-conditioning of 
the Hessian matrix, which can be addressed by a simple reparameterization of the DBM energy 
function, a trick called centering (an analogue to centering and skip-connections found in the de- 
terministic neural network literature ifTTl ffiln . As the barrier to joint- training Q is overcoming a 
challenging optimization problem, it is apparent that second-order gradient methods might prove to 
be more effective than simple stochastic gradient methods. This should prove especially important 
as we consider models with increasingly complex posteriors or higher-order interactions between 
latent variables. 

To this end, we explore the use of the Natural Gradient |2|, which seems ideally suited to the stochas- 
tic nature of Boltzmann Machines. Our paper is structured as follows. Section[2]provides a detailed 
derivation of the natural gradient, including its specific form for BMs. While most of these equations 

1 Joint-training refers to the act of jointly optimizing 8 (the concatenation of all model parameters, across all 
layers of the DBM) through maximum likelihood. This is in contrast to (15), where joint-training is preceded 
by a greedy layer-wise pretraining strategy. 
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have previously appeared in [3 1, our derivation aims to be more accessible as it attempts to derive the 
natural gradient from basic principles, while minimizing references to Information Geometry. Sec- 
tion [3]represents the true contribution of the paper: a practical natural gradient algorithm for BMs 
which exploits the persistent Markov chains of Stochastic Maximum Likelihood (SML) |18|, with 
a Hessian-Free (HF) like algorithm J8). The method, named Metric-Free Natural Gradient (MFNG) 
(in recognition of the similarities of our method to HF), avoids explicitly storing the natural gradient 
metric L and uses a linear solver to perform the required matrix-vector product L _1 E 9 [V log pel- 
Preliminary experimental results on DBMs are presented in Section|4] with the discussion appearing 
in Section|5] 



2 The Natural Gradient 
2.1 Motivation and Derivation 

The main insight behind the natural gradient is that the space of all probability distributions V = 
{pe(x); 9 E Q,x € \} forms a Riemannian manifold. Learning, which typically proceeds by 
iteratively adapting the parameters 9 to fit an empirical distribution q, thus traces out a path along 
this manifold. An immediate consequence is that following the direction of steepest descent in the 
original Euclidean parameter space does not correspond to the direction of steepest descent along V. 
To do so, one needs to account for the metric describing the local geometry of the manifold, which 
is given by the Fisher Information matrix [1 1, shown in Equation [4] While this metric is typically 
derived from Information Geometry, a derivation more accessible to a machine learning audience 
can be obtained as follows. 

The natural gradient aims to find the search direction A9 which minimizes a given objective func- 
tion, such that the Kullback-Leibler divergence KL(pg \\ pg+Ae) remains constant throughout 
optimization. This constraint ensures that we make constant progress regardless of the curvature 
of the manifold V and enforces an invariance to the parameterization of the model. The natural 
gradient for maximum likelihood can thus be formalized as: 

V N :=A9* <r- argmin Ae E q [-\ogp g+Ae (x)} (1) 
s.t. KL(p e || P8+Ae) = const. 

In order to derive a useful parameter update rule, we will consider the KL divergence under the 
assumption A9 — > 0. We also assume we have a discrete and bounded domain \ over which we 
define the probability mass functiorj^Jpg. Taking the Taylor series expansion of logpg + Ae around 9, 
and denoting V/ as the column vector of partial derivatives with as the i-th entry, and V 2 / the 

Hessian matrix with fl f L. in position (i, j), we have: 



KL{p e |j pg+Ae) ~ ^ pg log pe - 



X X 
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\ogpe + (V logpef A9 + ^A9 T (V 2 logp e ) A9 



A9 1 E pg [~\7'logp e \A9 (2) 



with the transition stemming from the fact that ^ Pe di Qg Pe — ^Z x( z x Po{x) — 0. Replacing 
the objective function of Equation [T] by its first-order Taylor expansion and rewriting the constraint 
as a Lagrangian, we arrive at the following formulation for C(9, A9), the loss function which the 
natural gradient seeks to minimize. 

£(0,A0) = E q [- log Pe ]+E q [-V log p e f A9 + ^ A9 T E Pe [-V 2 log Pe ] A9. 

Setting J^jj to zero yields the natural gradient direction V^: 

V N = L~ l E q [V logpe] with L = E Pl) [-V 2 \og Pe ] (3) 
or equivalently L = E Pe [V logpeV T logpe] (4) 



2 When clear from context, we will drop the argument of pe to save space. 
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While its form is reminiscent of the Newton direction, the natural gradient multiplies the estimated 
gradient by the inverse of the expected Hessian of log pg (Equation [3]) or equivalently by the Fisher 
Information matrix (FIM, Equation [4]). The equivalence between both expressions can be shown 
trivially, with the details appearing in the Appendix. We stress that both of these expectations 
are computed with respect to the model distribution, and thus computing the metric L does not 
involve the empirical distribution in any way. The FIM for Boltzmann Machines is thus not equal 
to the uncentered covariance of the maximum likelihood gradients. In the following, we pursue our 
derivation from the form given in Equation [4] 

2.2 Natural Gradient for Boltzmann Machines 

Derivation. Boltzmann machines define a joint distribution over a vector of binary random vari- 
ables x € {0, 1} N by way of an energy function E(x) = — J2k<i WkiXkXi — ^2 k bkXk, with 
weight matrix W <G M ArxJV and bias vector b <E M. N . Energy and probability are related by the 
Boltzmann distribution, such that p(x) = i cxp (— E(x)), with Z the partition function defined by 

Z = J2 x exp(-E(x)). 

Starting from the expression of L found in Equation[3] we can derive the natural gradient metric for 
Boltzmann Machines. 

L<- mr > = E pe [V 2 E{x) + V 2 log Z] = E pe [V 2 log Z] 



The natural gradient metric for first-order BMs takes on a surprisingly simple form: it is the expected 
Hessian of the log-partition function. With a few lines of algebra (whose details are presented in the 
Appendix), we can rewrite it as follows: 



L< BM ) = E pe \(VE(x) - E po [VE(x)]f (VE(x) - E pe [VE(x)]) 



(5) 



L (BM) is thus 

given by the covariance of VE, measured under the model distribution p$. Concretely, 
if we denote Wki and W mn as the i and j-th parameters of the model respectively, the entry Lij will 
take on the value — E [xkXix m x n ] + E [xkXi] E [x m x n ]. 



Discussion. When computing the Taylor expansion of the KL divergence in Equation [2] we 
glossed over an important detail. Namely, how to handle latent variables in pg(x), a topic first dis- 
cussed in 0. If x = [v, h], we could just as easily have derived the natural gradient by considering 
the constraint KL (^2 h po(v,h) \\ J2hPs+^e(v,h)) = const. Alternatively, since the distinction 
between visible and hidden units is entirely artificial (since the KL divergence does not involve the 
empirical distribution), we may simply wish to consider the distribution obtained by analytically 
integrating out a maximal number of random variables. In a DBM, this would entail marginalizing 
over all odd or even layers, a strategy employed with great success in the context of AIS [ 15 1. In this 
work however, we only consider the metric obtained by considering the KL divergence between the 
full joint distributions pe and pe+Ae- 



3 Metric-Free Natural Gradient Implementation 

We can compute the natural gradient V^r by first replacing the expectations of Equation[5]by a finite 
sample approximation. We can do this efficiently by reusing the model samples generated by the 
persistent Markov chains of SML. Given the size of the matrix being estimated however, we expect 
this method to require a larger number of chains than is typically used. The rest of the method is 
similar to the Hessian-Free (HF) algorithm of Martens (8): we exploit an efficient matrix-vector 
implementation combined with a linear-solver, such as Conjugate Gradient or MinRes lfT2l . to solve 
the system Ly — E q [Vlogpg] for y £ K . Additionally, we replace the expectation on the rhs. 
of this previous equation by an average computed over a mini-batch of training examples (sampled 
from the empirical distribution q), as is typically done in the stochastic learning setting. 

For Boltzmann Machines, the matrix-vector product Ly can be computed in a straightforward man- 
ner, without recourse to Pearlmutter's R-operator 1131 . Starting from a sampling approximation to 
Equation|5] we simply push the dot product inside of the expectation as follows: 
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Algorithm 1 MFNG_iteration(0, X + , Z~ ld ) 

9: parameters of the model. N :=| 9 \. 

X + : mini-batch of training examples, with X^ 

Z~ ld : previous state of Markov chains, with Z 



= {x m ;me [l,Af]}. 
= {z m ■= {v m ,h8?,h$y,m€ [1,M]} 



• Generate positive phase samples Z + = {z+ := (x m , h m , hm ); m € [1, M]} 

• Initializing M Markov chains from state Z~ ld , generate negative phase samples Z~ ew . 

• Compute the vectors s+ = aE Q Z g m ^ and = 9E ^ g m \ Vrn. 

• Compute negative log-likelihood gradient as g = A J2 m ( s m — s m)- 

• Denote S <G R RIxN as the matrix with rows s~ and S = jj J2 m s m- 

• Solve the system "Ly — g" fory, given L = (S — S) T (S — S) and an initial zero vector. 

• computeLy is a function which perforins equation [6] without instantiating L. 

• Vat# <— CGSolve(computeLy, S, g, zeros(N)) 



LM y « (S-S) T [(S-S)y] (6) 
with S €R MxN , the matrix with entries s„ - dE ^ Xm ^> 



and 



S E K w , the vector with entries s, = — , 

J M ^ 



M 

m 

and x m ~ pe(x), m G [1, M]. 

By first computing the matrix-vector product (S — S)y, we can easily avoid computing the full 
N x N matrix L. Indeed, the result of this operation is a vector of length M, which is then left- 
multiplied by a matrix of dimension N x M, yielding the matrix-vector product Ly G M. N , A 
single iteration of the MFNG is presented in Algorithm [T] A full open-source implementation is 
also available online^ 



4 Experiments 

We performed a proof-of -concept experiment to determine whether our Metric-Free Natural Gra- 
dient (MFNG) algorithm is suitable for joint- training of complex Boltzmann Machines. To this 
end, we compared our method to Stochastic Maximum Likelihood and a diagonal approximation of 
MFNG on a 3-layer Deep Boltzmann Machine trained on MNIST JTJ. All algorithms were run in 
conjunction with the centering strategy of Montavon and Muller [9|, which proved crucial to suc- 
cessfully joint-train all layers of the DBM (even when using MFNG)0 We chose a small 3-layer 
DBM with 784-400-100 units at the first, second and third layers respectively, to be comparable to 
iflOl . Hyper-parameters were varied as follows. For inference, we ran 5 iterations of either mean- 
field as implemented in |fT31l or Gibbs sampling. The learning rate was kept fixed during training and 
chosen from the set {5 • 10~ 3 , 10~ 3 , 10~ 4 }. For MinRes, we set the damping coefficient to 0.1 and 
used a fixed tolerance of 10 -5 (used to determine convergence). Finally, we tested all algorithms 
on minibatch sizes of either 25, 128 or 256 elements Finally, since we are comparing optimiza- 
tion algorithms, hyper-parameters were chosen based on the training set likelihood (though we still 
report the associated test errors). All experiments used the MinRes linear solver, both for its speed 
and its ability to return pseudo-inverses when faced with ill-conditioning. 

3 https://github.com/gdesjardins/MFNG 

The centering coefficients were initialized as in 1 9 1, but were otherwise held fixed during training. 
5 We expect larger minibatch sizes to be preferable, however simulating this number of Markov chains in 
parallel (on top of all other memory requirements) was sufficient to hit the memory bottlenecks of GPUs. 
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Figure 1: Estimated model likelihood as a function of (left) epochs and (right) CPU-time for MFNG, 
its diagonal approximation (MFNG-diag) and SML. All methods were run in conjunction with the 
DBM centering trick [9|, with centering coefficients held fixed during training. Our grid search 
yielded the following hyper-parameters: batch size of 256/128 for MFNG(-diag)/SGD; 5 steps of 
mean-field / sampling-based inference for MFNG(-diag)/SGD and a learning rate of 5 ■ 10~ 3 . 



Figure [T] (left) shows the likelihood as estimated by Annealed Importance Sampling 03] QT) as a 
function of the number of epochs Under this metric, MFNG achieves the fastest convergence, 
obtaining a training/test set likelihood of —71.26/— 72.84 nats after 94 epochs. In comparison, 
MFNG-diag obtains -73.22/- 74.05 nats and SML -80.12/- 79.71 nats in 100 epochs. The picture 
changes however when plotting likelihood as a function of CPU-time, as shown in Figure [T] (right). 
Given a wall-time of 8000s for MFNG and SML, and 5000s for MFNG-dia^ 7 ] SML is able to 
perform upwards of 1550 epochs, resulting in an impressive likelihood score of —64.94 / —67.73. 
Note that these results were obtained on the binary-version of MNIST (thresholded at 0.5) in order 
to compare to ifTOll . These results are therefore not directly comparable to Q5J , which binarizes 
the dataset through sampling (by treating each pixel activation as the probability p of a Bernouilli 
distribution). 

Figure [2] shows a breakdown of the algorithm runtime, for various components of the algorithm. 
These statistics were collected in the early stages of training, but are generally representative of the 
bigger picture. While the linear solver clearly dominates the runtime, there are a few interesting 
observations to make. For small models and batch sizes greater than 256, a single evaluation of Ly 
appears to be of the same order of magnitude as a gradient evaluation. In all cases, this cost is smaller 
than that of sampling, which represents a non-negligible part of the total computational budget. 
This suggests that MFNG could become especially attractive for models which are expensive to 
sample from. Overall however, restricting the number of CG/MinRes iterations appears key to 
computational performance, which can be achieved by increasing the damping factor a. How this 
affects convergence in terms of likelihood is left for future work. 

5 Discussion and Future Work 

While the wall-clock performance of MFNG is not currently competitive with SML, we believe there 
are still many avenues to explore to improve computational efficiency. Firstly, we performed almost 
no optimization of the various MinRes hyper-parameters. In particular, we ran the algorithm to 
convergence with a fixed tolerance of 10~ 5 . While this typically resulted in relatively few iterations 
(around 15), this level of precision might not be required (especially given the stochastic nature of 

6 While we do not report error margins for AIS likelihood estimates, the numbers proved robust to changes in 
the number of particles and temperatures being simulated. To obtain such robust estimates, we implemented all 
the tricks described in Salakhutdinov and Hinton 1 15] and 1 16]: pa a zero-weight base-rate model whose biases 
are set by maximum likelihood; interpolating distributions pi oc p^'^^p^^, with ps the target distribution; 
and finally analytical integration of all odd-layers. 

7 This discrepancy will be resolved in the next revision. 



5 




Figure 2: Breakdown of algorithm runtime, when we vary (left) the batch size (with fixed model 
architecture 784 — 400 — 100) and (right) the model size (with fixed batch size of 256). Runtime 
is additive in the order given by the labels (top to bottom). Dotted lines denote intermediate steps, 
while continuous lines denote full steps. Data was collected on a Nvidia GTX 480 card. 



the algorithm). Additionally, it could be worth exploiting the same strategy as HF where the linear 
solver is initialized by the solution found in the previous iteration. This may prove much more 
efficient than the current approach of initializing the solver with a zero vector. Pre-conditioning 
is also a well-known method for accelerating the convergence speed of linear solvers [5|. Our 
implementation used a simple diagonal regularization of L. The Jacobi preconditioner could be 
implemented easily however by computing the diagonal of L in a first-pass. 

Finally, while our single experiment offers little evidence in support of either conclusion, it may very 
well be possible that MFNG is simply not computationally efficient for DBMs, compared to SML 
with centering. In this case, it would be worth applying the method to either (i) models with known 
ill-conditioning, such as factored 3-rd order Boltzmann Machines or (ii) models and distributions 
exhibiting complex posterior distributions. In such scenarios, we may wish to maximize the use 
of the positive phase statistics (which were obtained at a high computational cost) by performing 
larger jumps in parameter space. It remains to be seen how this would interact with SML, where the 
burn-in period of the persistent chains is directly tied to the magnitude of A9. 



Appendix 

We include the following derivations for completeness. 



5.1 Expected Hessian of log Z and Fisher Information. 
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5.2 Derivation of Equation [5] 
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